{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Syntax Parser\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/syntax_parser.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
    "\n",
    "    4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "Syntactic parsing is the task of prediction of the syntactic tree given the tokenized (or raw) sentence.\n",
    "\n",
    "To define a tree, for each word one should know its syntactic head and the dependency label for the edge between them.\n",
    "For example, the tree above can be restored from the data\n",
    "\n",
    "```\n",
    "    1\tJohn    2\tnsubj\t\n",
    "    2\tbought  0\troot\t\n",
    "    3\ta       6\tdet\t\n",
    "    4\tvery    5\tadvmod\t\n",
    "    5\ttasty   6\tamod\t\n",
    "    6\tcake    2\tobj\n",
    "    7\t.       2\tpunct\n",
    "```\n",
    "Here the third column contains the positions of syntactic heads and the last one -- the dependency labels.\n",
    "The words are enumerated from 1 since 0 is the index of the artificial root of the tree, whose only\n",
    "dependent is the actual syntactic head of the sentence (usually a verb).\n",
    "\n",
    "Syntactic trees can be used in many information extraction tasks. For example, to detect who is the winner\n",
    "and who is the loser in the sentence *Manchester defeated Liverpool* one relies on the word order. However,\n",
    "many languages, such as Russian, Spanish and German, have relatively free word order, which means we need\n",
    "other cues. Note also that syntactic relations (`nsubj`, `obj` and so one) have clear semantic counterparts,\n",
    "which makes syntactic parsing an appealing preprocessing step for the semantic-oriented tasks.\n",
    "\n",
    "We use BERT as the lowest layer of our model (the embedder). To extract syntactic information we apply\n",
    "the biaffine network of [Dozat, Manning, 2017](https://arxiv.org/pdf/1611.01734.pdf).\n",
    "For each sentence of length `K` this network produces two outputs: the first is an array of shape ``K*(K+1)``,\n",
    "where `i`-th row is the probability distribution of the head of `i`-th word over the sentence elements.\n",
    "The 0-th element of this distribution is the probability of the word to be a root of the sentence.\n",
    "The second output of the network is of shape `K*D`, where `D` is the number of possible dependency labels.\n",
    "\n",
    "The easiest way to obtain a tree is simply to return the head with the highest probability\n",
    "for each word in the sentence. However, the graph obtained in such a way may fail to be a valid tree:\n",
    "it may either contain a cycle or have multiple nodes with head at position 0.\n",
    "Therefore we apply the well-known Chu-Liu-Edmonds algorithm for minimal spanning tree\n",
    "to return the optimal tree, using the open-source modification from [dependency_decoding package](https://pypi.org/project/ufal.chu-liu-edmonds/).\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before using the model make sure that all required packages are installed running the command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install syntax_ru_syntagrus_bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Models list\n",
    "\n",
    "The table presents a list of all of the syntax parsing models available in the DeepPavlov Library.\n",
    "\n",
    "| Config | Description |\n",
    "| :--- | :--- |\n",
    "| morpho_syntax_parser/syntax_ru_syntagrus_bert.json | Config with the model which defines for each token in the sentence <br> its head and dependency type in the syntactic tree. |\n",
    "| morpho_syntax_parser/ru_syntagrus_joint_parsing | Config which unifies syntax parsing and morphological tagging. |\n",
    "\n",
    "The table presents comparison of syntax_ru_syntagrus_bert config with other models on UD2.3 dataset.\n",
    "\n",
    "| Model | UAS | LAS |\n",
    "| :--- | :---: | :---: |\n",
    "| [UD Pipe 2.3](http://ufal.mff.cuni.cz/udpipe) (Straka et al., 2017)  | 90.3 | 89.0 |\n",
    "| [UD Pipe Future](https://github.com/CoNLL-UD-2018/UDPipe-Future) (Straka, 2018) | 93.0 | 91.5 |\n",
    "| [UDify (multilingual BERT)](https://github.com/hyperparticle/udify) (Kondratyuk, 2018) | 94.8 | 93.1 |\n",
    "| Our BERT model (morpho_syntax_parser/syntax_ru_syntagrus_bert.json) | 94.9 | 93.4 |\n",
    "\n",
    "So our model is the state-of-the-art system for Russian syntactic parsing.\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "### Syntax Parser\n",
    "\n",
    "Our model produces the output in [CONLL-U format](http://universaldependencies.org/format.html)\n",
    "and is trained on Universal Dependency corpora, available on http://universaldependencies.org/format.html .\n",
    "The example usage for inference is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model\n",
    "\n",
    "model = build_model(\"syntax_ru_syntagrus_bert\", download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tЯ\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
      "2\tшёл\t_\t_\t_\t_\t0\troot\t_\t_\n",
      "3\tдомой\t_\t_\t_\t_\t2\tadvmod\t_\t_\n",
      "4\tпо\t_\t_\t_\t_\t6\tcase\t_\t_\n",
      "5\tнезнакомой\t_\t_\t_\t_\t6\tamod\t_\t_\n",
      "6\tулице\t_\t_\t_\t_\t2\tobl\t_\t_\n",
      "7\t.\t_\t_\t_\t_\t2\tpunct\t_\t_\n",
      "\n",
      "1\tДевушка\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
      "2\tпела\t_\t_\t_\t_\t0\troot\t_\t_\n",
      "3\tв\t_\t_\t_\t_\t5\tcase\t_\t_\n",
      "4\tцерковном\t_\t_\t_\t_\t5\tamod\t_\t_\n",
      "5\tхоре\t_\t_\t_\t_\t2\tobl\t_\t_\n",
      "6\t.\t_\t_\t_\t_\t2\tpunct\t_\t_\n"
     ]
    }
   ],
   "source": [
    "sentences = [\"Я шёл домой по незнакомой улице.\", \"Девушка пела в церковном хоре.\"]\n",
    "for parse in model(sentences):\n",
    "    print(parse, end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As prescribed by UD standards, our model writes the head information to the 7th column and the dependency\n",
    "information -- to the 8th. Our parser does not return morphological tags and even does not use them in\n",
    "training.\n",
    "\n",
    "### Joint Syntax Parser and Morphological tagger\n",
    "\n",
    "Our model in principle supports joint prediction of morphological tags and syntactic information, however, the quality of the joint model is slightly inferior to the separate ones. Therefore we release a special component that can combine the outputs of tagger and parser: `deeppavlov.models.syntax_parser.joint.JointTaggerParser`. Its sample output for the Russian language with default settings (see the configuration file `morpho_syntax_parser/ru_syntagrus_joint_parsing.json` for exact options) looks like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model\n",
    "\n",
    "model = build_model(\"ru_syntagrus_joint_parsing\", download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tЯ\tя\tPRON\t_\tCase=Nom|Number=Sing|Person=1\t2\tnsubj\t_\t_\n",
      "2\tшёл\tшёл\tVERB\t_\tAspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act\t0\troot\t_\t_\n",
      "3\tдомой\tдомой\tADV\t_\tDegree=Pos\t2\tadvmod\t_\t_\n",
      "4\tпо\tпо\tADP\t_\t_\t6\tcase\t_\t_\n",
      "5\tнезнакомой\tнезнакомый\tADJ\t_\tCase=Dat|Degree=Pos|Gender=Fem|Number=Sing\t6\tamod\t_\t_\n",
      "6\tулице\tулица\tNOUN\t_\tAnimacy=Inan|Case=Dat|Gender=Fem|Number=Sing\t2\tobl\t_\t_\n",
      "7\t.\t.\tPUNCT\t_\t_\t2\tpunct\t_\t_\n",
      "1\tДевушка\tдевушка\tNOUN\t_\tAnimacy=Anim|Case=Nom|Gender=Fem|Number=Sing\t2\tnsubj\t_\t_\n",
      "2\tпела\tпеть\tVERB\t_\tAspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act\t0\troot\t_\t_\n",
      "3\tв\tв\tADP\t_\t_\t5\tcase\t_\t_\n",
      "4\tцерковном\tцерковном\tADJ\t_\tCase=Loc|Degree=Pos|Gender=Masc|Number=Sing\t5\tamod\t_\t_\n",
      "5\tхоре\tхор\tNOUN\t_\tAnimacy=Inan|Case=Loc|Gender=Masc|Number=Sing\t2\tobl\t_\t_\n",
      "6\t.\t.\tPUNCT\t_\t_\t2\tpunct\t_\t_\n"
     ]
    }
   ],
   "source": [
    "sentences = [\"Я шёл домой по незнакомой улице.\", \"Девушка пела в церковном хоре.\"]\n",
    "for parse in model(sentences):\n",
    "    print(parse, end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the basic case the model outputs a human-readable string with parse data for each information. If you need\n",
    "to use the output in Python, consult the `deeppavlov.models.syntax_parser.joint.JointTaggerParser` and source code.\n",
    "\n",
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact syntax_ru_syntagrus_bert -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n",
    "\n",
    "# 5. Customize the model\n",
    "\n",
    "To train **syntax parser** on your own data, you should prepare a dataset in **CoNLL-U format**. The description of **CoNLL-U format** can be found [here](https://universaldependencies.org/format.html#conll-u-format).\n",
    "\n",
    "Then you should place files for training, validation and testing into the ``\"data_path\"`` directory of ``morphotagger_dataset_reader``, change file names in ``morphotagger_dataset_reader`` to your filenames and launch the training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import train_model\n",
    "\n",
    "train_model(\"<your_syntax_parsing_config_name>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or **using CLI**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov train <your_syntax_parser_config_name>"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}
